Concepts of Basic Quantitative Data Analysis

Yebelay Berehan

2025-12-05

Training curriculum

Training Outline

5 modules · 10 topics

What is Statistics?

Definition & scope : role of statistics in research and decision-making
Types of data : categorical, continuous, ordinal, discrete variables

Branches of Statistics

Descriptive statistics : summarizing and visualizing data distributions
Inferential statistics : drawing conclusions from samples to populations

Hypothesis Testing & Statistical Errors

T-Tests & ANOVA : comparing means across two or more groups
Confidence intervals vs P-values : interpreting uncertainty & significance
Type I & Type II errors : false positives, false negatives, and power

Measures of Association

Chi-Square Test : independence between categorical variables
Pearson's Correlation : strength and direction of linear relationships

Introduction to Statistical ModelsAdvanced

Regression analysis : linear & logistic regression, model interpretation
Survival data analysis : Kaplan-Meier curves, Cox proportional hazards
Longitudinal data analysis : mixed models for repeated measures

Goal of the training

The goal of this training is to:

Provide an overview of fundamental statistical methods for the quantitative data analysis, including key theoretical results presented.
- Equip staff with essential knowledge and skills to interpret data effectively.
- Apply statistical methods that support organizational goals.
Introduce Advanced Topics in Quantitative Data Analysis:
- Regression analysis, Survival Analysis, Longitudinal data analysis
Enable trainees to pose scientific questions within the context of appropriate statistical methods and carry out and interpret analyses effectively.

What is statistics and quantitative data analysis?

Statistics is the science of:
- Collecting, analyzing, summarizing, and interpreting data
- Drawing conclusions to support effective decision-making.
Transforms raw data into meaningful information.
Helps us learn about the world through data-driven insights.
Widely applied in fields like:
- Biology, sociology, economics, public health, and business
When applied to biological or health-related data, it is referred to as Biostatistics

BRANCHES OF STATISTICS

Branches of Statistics

The field of statistics consists of two major branches that help us describe data and make informed decisions based on evidence.

Purpose of Descriptive Statistics

Describe the research sample
- Understand background characteristics of study participants
- Ensures proper interpretation of findings
Understand the data
- Summarizes main study variables before inferential analysis
- Supports answering research questions in descriptive studies
- Aids in data cleaning and outlier detection
Check assumptions for inferential statistics
- Example: Histograms assess normality in univariate/multivariable analysis
Present descriptive results appropriately
- Choice of presentation method depends on variable types
- Understanding variable types is essential for accurate summary

Purpose of Descriptive Statistics

Types of variables

What is a Variable?

A variable is any characteristic, number, or quantity that can be measured or counted and differs across individuals or observations.

Two Main Types of Variables

a). Categorical Variables (Qualitative)

Describe groups or categories.
- Examples: Sex, Region, Marital Status, Education Level.

b). Numerical Variables (Quantitative)

Represent numbers and quantities.
- Examples: Age, Income, Height, Number of Children.

MEASURE OF CENTRAL TENDENCY

Measure of central tendency

Refers to a statistical measure that identifies a single value representing the entire distribution.
Aims to provide an accurate summary of the overall data.
Represents the most typical or central value in a dataset.
Known as the “number crunching” part of data description.
Three common measures:
- Mean – the average of all values
- Median – the middle value when data is ordered
- Mode – the most frequently occurring value

Note

Covers statistical methods for describing data using statistical characteristics, charts, graphics or tables.

Mean

Also known as the “average” in everyday language.
Applicable only to continuous variables (interval or ratio scale).
Represents the center of gravity of a distribution.
- Sum of all values ÷ Number of values (n)

Provides a balanced summary of the dataset.

Median

The middle value in an ordered dataset (from smallest to largest).
Represents the 50th percentile, half the values lie above and half below.
Not affected by extreme values (outliers) → ideal for skewed distributions

If n is odd: Median = value at position

\[\left[\frac{(n+1)}{2}\right]^{th} \text{ value}\]

If n is even:

\[\text{average of } \left(\frac{n}{2}\right)^{th} \text{ and } \left[\left(\frac{n}{2}\right)+1\right]^{th} \text{ values}\]

Example in calculating Median

Mean vs Median

Median is much more robust against scattering.
An outlier usually has no influence on the median, but it has a more or less large influence on the mean.

Mean vs Median (continued)

When the data is symmetric and has no outliers, the mean is the best measure because it uses all values and reflects the overall dataset accurately.
However, if the quantitative data is skewed or has outliers, the median is a better choice.
It is often presented with histograms or summary tables showing the mean and standard deviation.

MEASURE OF DISPERSION

Measure of Dispersion

Dispersion or variability refers to how dispersed or deviated the values from one another in a distribution.
Three most widely used measures of dispersion are:
- Variance, Standard deviation
- Interquartile and Range
Variance
Variance measures the amount of spread or variability of observation from mean.
The sample variance (s²) is the average of the squares of each deviation from the sample mean.

\[s^2 = \frac{1}{n-1}\sum_{i=1}^{n}\left(x_i - \bar{x}\right)^2\]

Standard deviation

The standard deviation is a statistical metric that quantifies the dispersion or variability of data points relative to their mean.
It reflects how far on average, each individual value deviates from the mean, offering insight into the spread of the data.
A low standard deviation means that the values are close to the mean.
The population (σ) and sample (S) SD formula:

Example of variance and standard deviation calculation

\[s^2 = \frac{1}{n-1}\sum_{i=1}^{n}\left(x_i - \bar{x}\right)^2\]

Interquartile range (IQR)

IQR is the distance between the 1st and 3rd quartile.
It is not sensitive to extreme values (outliers). Thus, it is usually described together with the median in skewed distribution of observation.

\[\text{Formula: } IQR = 3rd\ quartile - 1st\ quartile = q_3 - q_1\]

ORGANIZATION AND PRESENTATION OF DATA (TABLES & GRAPHS)

Tabular Presentation

Note

Statistic presentation Frequency Table. Effective for presenting large amounts of data. Tables should have clear titles, row and column labels, and units of measurement.

Charts

In charts, data is represented graphically.
- therefore they are used in statistics mainly to get an overview of the collected data and to prepare information in an easily understandable way.
  - i.e., to display data patterns, relationships, and trends.
The most commonly used charts in statistics are bar charts, histograms, scatter plots, line plots, box plots or pie charts.

Bar chart

Bar charts are one of the most common charts in scientific work and they are mostly used to show either frequencies or averages.
Depending on how the bars are arranged, a distinction is made between horizontal and vertical bar charts.

Bar chart (continued)

Due to the simplicity of bar charts, they are often created in descriptive statistics.
They provide a very quick overview of the ranking and frequencies of characteristic values.
A bar chart shows absolute and relative frequencies on a two-axis coordinate system.

Grouped bar charts

If two categorical variables are present, grouped bar charts can be created.
In grouped bar charts, either the frequency, the percent, or the percent in each group can be specified.

Overall, bar chart is

Histogram

A histogram is a graphical representation of the frequency distribution of a numeric variable.
To display a distribution of data in a histogram, the data must first be divided into classes, also called bins.
These classes or bins are then represented by rectangles that lie directly next to each other.

Used for normality checking

Polygon

A frequency polygon is a graph that displays the data using lines to connect points plotted for the frequencies

Box plot

Boxplots provide a compact visual summary of data distributions.
Key features displayed include:
- Median (central value), Interquartile Range (IQR) (middle 50% of the data), Outliers (unusual values outside the typical range)
Ideal for continuous data (e.g., age, income, temperature).

Note

Commonly used to compare multiple groups.

Comparing age distributions across different population groups.

How is a boxplot interpreted?

The box itself indicates the range in which the middle 50% of all values lie.
Thus, the lower end of the box is the 1st quartile, and the upper end is the 3rd quartile.

Boxplot illustration

Scatter Plots

Used to visualize correlations between two variables.
Each data point represents a pair of values in a coordinate system.
Example: Plotting height vs. weight for individuals.

Note

What can you say for this graph?

Scatter Plots (continued)

Helps identify the type of correlation:
Positive correlation: Both variables increase together.
Negative correlation: One variable increases while the other decreases.
No correlation: Points appear randomly scattered.

Note

Can also show non-linear relationships where data follows a pattern but not a straight line.

Good practice in data presentation: tabular result

When presenting descriptive statistics in tables, it is good practice to keep the table simple, organized, and easy to read.
Use clear headings for each variable, and label columns properly (e.g., mean, median, SD).
Align numbers neatly, usually to the right, and round them to a consistent number of decimal places.
Always add a clear table title and footnotes if any explanations or abbreviations are used, so the reader can understand the table without extra help.

Tabular presentation or reporting

Graphical data presentation or reporting

Choose simple, clear, and appropriate graphs that match the type of data.
For categorical data, use bar charts or pie charts to show frequencies or proportions.
For quantitative data, use histograms for distribution, box plots for medians and spread, and scatter plots to show relationships between two variables.

Note

Always label the axes clearly, include units where needed, and use a title that explains what the graph shows.

INFERENTIAL STATISTICS

Inferential Statistics

Involves drawing conclusions about a population from a random sample. Educated guess rather that describing data.
Useful when it is impractical to study the entire population.
Allows generalization from sample data to a larger group.
Example:
Evaluating the effect of a cash transfer program by surveying a sample of recipients and inferring the results to the broader population.
Key Methods: depends on data type, assumption, objective
Group comparison tests:
- t-test, chi-square test, ANOVA (Analysis of Variance)
Relationship/correlation tests:
- Correlation analysis, regression analysis

Random Sample and Sampling Error

A random and representative sample approximates the population, but is rarely identical to it.
The difference between a sample statistic and the true population value is called sampling error.
Sampling error is:
- Natural and expected
- Unavoidable when using samples instead of full populations

Random Sample and Sampling Error (continued)

Since population values are unknown, the exact sampling error is also unknown.
We use statistical methods to estimate sampling error and evaluate reliability.

Role of Hypothesis Testing:

Helps determine whether sample results reflect true population effects or are due to random chance.
Accounts for expected variability caused by sampling error.

Hypothesis testing

A hypothesis is an assumption about a relationship or effect that is neither proven nor disproven at the start of a study.
It is developed based on a research question and typically justified through a literature review.
- A hypothesis proposes an expected association (e.g., “Men earn more than women in the same job in Ethiopia”).
- The goal is to reject the hypothesis based on data analysis.
- Data from surveys or experiments are used, and a hypothesis test (e.g., t-test, correlation analysis) is applied.

Steps to formulate and test a hypothesis

Define a clear research question.
Formulate a specific hypothesis about the population.
Collect relevant data.
Select and apply an appropriate test.

Note

Hypotheses are not simple statements; they are formulated in such a way that they can be tested with collected data in the course of the research process.

Null and alternative hypothesis

In hypothesis testing, we always define two opposing hypotheses:
Null hypothesis (H₀): Assumes no difference or no effect between groups.
- Example: The salaries of men and women in Ethiopia do not differ.
Alternative hypothesis (H₁): Assumes a difference or an effect between groups.
- Example: The salaries of men and women in Ethiopia differ.
The hypothesis you want to prove (based on theory or research) is usually the alternative hypothesis.
In a hypothesis test, you only test the null hypothesis and decide whether to reject it.

Level of significance or probability of error

In hypothesis testing, we can never be 100% sure when rejecting the null hypothesis, there is always a small chance we are wrong.
The significance level (α) is the allowed probability of wrongly rejecting a true null hypothesis.
- If the p-value is smaller than α, we reject the null hypothesis.
- If the p-value is greater than α, we do not reject the null hypothesis.
Common significance levels are: 5% (α = 0.05) → 5% risk of wrongly rejecting a true null hypothesis.

Example: Two-Sample t-Test

Used to compare the means of two independent groups.
A larger difference between sample means suggests it’s less likely both groups come from the same population.

Key Concepts:

If p-value < significance level (α) → Reject H₀ (null hypothesis)
- Example: p = 0.04 < 0.05 → there’s a 4% chance the observed (or more extreme) mean difference occurred by random chance, assuming no true difference.
The significance level (α):
- Must be set before analysis (commonly 0.05)
- Must not be adjusted afterward to influence results

Types of Errors in Hypothesis Testing

Hypothesis testing is based on sample data, so errors can occur due to random variation. No test is 100% foolproof - sample results naturally vary by chance.

Two Main Types of Errors:

Type I Error (α):
- Rejecting the null hypothesis when it is actually true
- False positive → concluding there is an effect when there isn’t
Type II Error (β):
- Failing to reject the null hypothesis when the alternative is true
- False negative → missing a real effect that exists

Type I Error (\(\alpha\)) - False Positive

Occurs when the null hypothesis is rejected even though it is actually true.
Also called a false positive → concluding there is an effect when there isn’t one.
The probability of committing a Type I error is the significance level \(\alpha\), set by the researcher (commonly \(\alpha = 0.05\)).

\[P(\text{Type I Error}) = \alpha\]

Example: A drug trial concludes the drug is effective, but in reality it has no effect. The result was due to random chance in the sample.

Type II Error (\(\beta\)) - False Negative

Occurs when the null hypothesis is not rejected even though the alternative hypothesis is true.
Also called a false negative → missing a real effect that actually exists.
The probability of committing a Type II error is denoted \(\beta\).

\[P(\text{Type II Error}) = \beta\]

Example: A drug trial concludes the drug has no effect, but it actually does work. The study failed to detect the real effect.

Statistical Power

The power of a test is the probability of correctly rejecting \(H_0\) when it is false (the ability to detect a true effect).

\[\text{Power} = 1 - \beta\]

The Trade-off Between Type I and Type II Errors

Decreasing \(\alpha\) (stricter threshold) → reduces Type I errors, but increases Type II errors (\(\beta\)).
Increasing \(\alpha\) (lenient threshold) → reduces Type II errors, but increases Type I errors.
The only way to reduce both simultaneously is to increase the sample size \(n\).

Why Errors Happen?

Sample results vary by chance
- no sample perfectly represents the population.
No test is 100% foolproof
- a decision based on probabilities will occasionally be wrong.
The goal is not to eliminate errors entirely, but to control and minimise them through
- appropriate study design,
- sample size, and
- significance thresholds.

Choosing the Right Statistical Test

Selecting the appropriate test depends on three key factors:

Type of variables - categorical (nominal/ordinal) or numeric (continuous/discrete)
Number of groups or samples - one, two, or more groups
Relationship between samples - independent or related (paired)

Common Statistical Tests

Test	Purpose	Variable Type
T-test	Compares mean differences	Numeric (DV), Categorical (IV)
Chi-square	Tests association between categorical variables	Categorical
One-way ANOVA	Compares means across 3+ independent groups	Numeric (DV), Categorical (IV)
Two-way ANOVA	Examines two factors and their interaction effects	Numeric (DV), Two categorical (IV)

a) T-test

Used to compare mean differences between groups. Three variants:

Type	When to Use	Example
One-sample t-test	Compare a sample mean to a known population value	Is the average exam score different from 70?
Independent two-sample t-test	Compare means of two unrelated groups	Do males and females differ in weight?
Paired-sample t-test	Compare means within the same group at two time points	Did weight change after an intervention?

\(t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \quad \text{(one-sample)}\)
\(t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s^2_p\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \quad \text{(independent two-sample)}\)
\(t = \frac{\bar{d}}{s_d / \sqrt{n}} \quad \text{(paired)}\)

b) Chi-Square Test

Tests the association between two categorical variables.

\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\]

\(O_i\) = observed frequency, \(E_i\) = expected frequency
Used for nominal or ordinal data - never for continuous variables
See the Chi-Square Test notes for full breakdown.

c) One-way / Two-way ANOVA

One-way ANOVA: Compares means across three or more independent groups on one factor.
Two-way ANOVA: Examines two independent factors simultaneously, including their interaction effect on the outcome.

\[F = \frac{\text{Variance between groups}}{\text{Variance within groups}}\]

If \(F\) is large, the between-group differences are unlikely due to chance alone.

Checking Assumptions of Statistical Tests

Most statistical tests require certain assumptions to be valid.
Assumptions should always be checked before running tests - violating them can lead to invalid conclusions.

How to Check Assumptions

Method	Approach
Numerically	Compare mean vs. median; assess skewness and kurtosis
Statistically	Use formal normality tests (Shapiro-Wilk, Kolmogorov-Smirnov)
Graphically	Use boxplots, histograms, Q-Q plots

Normality Assumption

Many tests including the t-test and ANOVA assume that the data (or residuals) follow a normal distribution.

Graphical Checks

Histogram

Visually inspect the shape of the distribution.
A bell-shaped, symmetric histogram suggests normality.
Skewed or multi-modal shapes suggest non-normality.

Q-Q Plot (Quantile-Quantile Plot)

Plots observed quantiles against theoretically expected quantiles under normality.
If points fall along the diagonal line → normality is supported.
Systematic deviations from the line → normality is violated.

Formal Normality Tests

Test	Best Used When	Null Hypothesis
Shapiro-Wilk	Small to medium samples (\(n < 50\))	Data are normally distributed
Kolmogorov-Smirnov	Larger samples	Data follow a specified distribution
Lilliefors correction	When population parameters are unknown	Data are normally distributed

Interpretation: If \(p < 0.05\), reject \(H_0\) → the data significantly deviate from normality.

What to Do When Normality Is Violated

Situation	Recommended Action
Mild violation, large \(n\)	Parametric tests are still robust (Central Limit Theorem applies)
Small sample, clear skew	Use non-parametric alternatives
T-test violated	Use Mann-Whitney U (independent) or Wilcoxon (paired)
ANOVA violated	Use Kruskal-Wallis test
Chi-square violated	Use Fisher’s Exact Test

Confidence Interval

A CI defines a range where the true population parameter (e.g., mean) is likely to lie.
Sample estimates (mean, variance) are only approximations of the true population values.
Based on the sample mean, sample size (n), and sample standard deviation (s) and assumes a normal distribution of the parameter, CI is given as

Confidence Interval (continued)

If the sample is small, the t-distribution is used instead of the normal distribution.
Then the z value is replaced by t and the formula is: \[CI = \bar{x} \pm t.\frac{s}{\sqrt{n}} \]
If a 95% confidence interval is given, you can be 95% sure that the true value of the parameter lies within that interval.

Statistical tests for differences

One-Sample t-Test
Purpose: Tests if the sample mean differs significantly from a known or hypothesized population mean.
Used when: You have one sample and a fixed reference value (e.g., target, population mean).

One-Sample t-Test (continued)

Requirements
Random sample and approx. normal distribution of the data
Numeric (continuous) data

Types of Questions

Two-tailed: Is the sample mean different from the reference value?
One-tailed: Is the sample mean greater than or less than the reference value?

Hypotheses

H₀ (Null): μ = μ₀ (Sample mean equals the reference mean)
H₁ (Alternative): μ ≠ μ₀ (Sample mean differs from the reference)

Example: for pre specified average µ = 28

Note

A t-test showed a statistically reliable difference between the score of students who attended the online course and the average score of students who did not attend an online course. t(11) = 2.75, p < 0.02, α = 0.05.

Two-Sample t-Test (Independent Samples)

Purpose: Tests if two independent groups differ significantly in their means.
Used when: Comparing two unrelated groups (e.g., treatment vs placebo, male vs female).
Requirements
Two independent samples & Numerical (continuous) data
Normal distribution in each group (or large enough samples for approximation)
Homogeneity of variances (can be tested with Levene’s test)

Examples

Does Drug XY reduce weight compared to a placebo?
Is there a health difference between people with and without a degree?

Hypotheses

H₀ (Null): μ₁ = μ₂ (No difference between group means)
H₁ (Alternative): μ₁ ≠ μ₂ (There is a difference between group means)

Assumptions of the Independent t-test

When performing an independent t-test, the following assumptions must be met:

Independence of Groups: The two samples must be independent.
- A value in one group must not influence a value in the other group.
Dependent Variable Must Be Numeric
Normal Distribution: The data in each group should be approximately normally distributed.
Homogeneity of Variance: The variances of the two groups should be similar.
- This is often tested using: Levene’s Test for Equality of Variances.
- If variances are unequal, a corrected version of the t-test (Welch’s t-test) should be used.

What is your conclusion?

Conclusion of the independent t-test

In this example, the p-value is 31%, which is higher than the 5% significance level, so there is no significant difference between the two groups.
The confidence interval (-6.328 to 18.118) crosses zero, also showing no significant difference.

Report a t-test for independent samples:

An independent samples t-test was conducted to compare exam results in summer and winter. There was not a significant difference in the scores, p =0.31. The magnitude of the differences in the means (mean difference =5.9, 95% CI: [-6.33, 18.12]) was large.

Paired-Samples t-test (Dependent t-test)

The paired-samples t-test is used to compare two related groups to determine whether their mean difference is statistically significant.
Pairs of observations are needed:
- Repeated measurements on the same individuals (before vs after treatment).
- Matched subjects across two groups (e.g., twins, matched cases).
Controls for individual variability because comparisons are within the same subjects.
Greater chance to detect a real difference if one exists.

Look at before and after weight measure

ANOVA (Analysis of Variance)

ANOVA is used to test whether statistically significant differences exist between three or more group means.
Why not use multiple t-tests?
- Every hypothesis test carries a risk of Type I error (commonly set at 5%).
- Running multiple t-tests increases the probability of making at least one false-positive conclusion.
- ANOVA helps to avoid inflated error rates when comparing several groups simultaneously.
Independent variables (factors): Categorical.
Dependent variable: Continuous variable
F-ratio: The test statistic in ANOVA.
- It compares between-group variability to within-group variability.

Types of Anova

Most common used ANOVA for non repeated measures are:

One-factor ANOVA

Does a person’s place of residence (independent variable) influence his or her salary?

Two-factors ANOVA

Does a person’s place of residence (1st independent variable) and gender (2nd independent variable) affect his or her salary?

Example for One-factor ANOVA

With the help of the dependent variable, e.g. “highest educational qualification” with the three characteristics group 1, group 2 and group 3 should be explained as much variance of the dependent variable “salary” as possible.

Note

Accordingly, in case A) the groups have a very high influence on the salary and in case B) they do not.

ANOVA hypotheses

Null hypothesis H0: The mean of all groups is equal.
Alternative hypothesis H1: There are differences in the means of the groups.
You want to check whether there is a difference in coffee consumption between students in different subjects. To do this, ask 10 students from each field of study.

Two-Way ANOVA (Two-Factor ANOVA)

Two-way ANOVA tests whether two independent categorical variables (factors) influence a continuous dependent variable.
It also tests whether there is an interaction effect between the two factors.
You want to test the effects of two factors on one dependent variable.
Example questions:
- Does gender and education affect salary?
- Does therapy type and gender affect blood pressure?

Two-Way ANOVA (continued)

Three Questions Answered by Two-Way ANOVA

Does Gender affect the dependent variable?
Does Education affect the dependent variable?
Is there an interaction effect between Factor 1 and Factor 2?

Hypothesis:

Steps in Two-Way ANOVA Calculation

Calculate group means (e.g., mean attitude scores for male & studied, male & not studied, etc.).
Calculate overall mean of all values.
Calculate Sums of Squares:
- SStotal (total variation from overall mean)
- SSfactorA (variation explained by Factor 1)
- SSfactorB (variation explained by Factor 2)
- SSinteraction (variation explained by interaction of A and B)
- SSerror (unexplained variation)
Calculate degrees of freedom for each component.
Calculate mean squares (variance estimates).
Calculate F-values:
- F = Variance of Factor / Error Variance
Interpret p-values to accept or reject the null hypotheses.

Two-Way ANOVA example

Example: Does gender (male or female) and education status (studied or not) influence a person’s attitude towards retirement planning?
Dependent Variable: Attitude towards retirement planning (rated from 1 = not important to 10 = very important).

Note

All p-values > 0.05. No significant effect of Study, Gender and their interaction retirement planning

Chi-Square Test

The Chi-square test is a statistical hypothesis test used for categorical variables (i.e., nominal or ordinal scales, e.g., types of assistance received, satisfaction levels).

It is used to determine whether there is a significant association between two or more categorical variables.

The Chi-square test can answer three types of questions:

Type	Question
Test of Independence	Are two categorical variables independent of each other?
Test of Goodness-of-Fit	Do observed frequencies match an expected distribution?
Test of Homogeneity	Do two or more groups come from the same population?

Test of Independence

Used to test whether two categorical variables are independent of each other.

Hypotheses

\(H_0: \text{The two variables are independent (no association)}\)

\(H_1: \text{The two variables are not independent (association exists)}\)

\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\]

where the expected frequency for each cell is:

\(E_{ij} = \frac{R_i \times C_j}{N}\)

\(df = (r - 1)(c - 1)\)

where \(r\) = number of rows and \(c\) = number of columns in the contingency table.

Term	Name	Meaning
\(O_i\)	Observed frequency	Actual count recorded in each cell of the contingency table
\(E_i\)	Expected frequency	Count expected under \(H_0\) if the variables were truly independent
\((O_i - E_i)^2\)	Squared deviation	Amplifies large discrepancies between observed and expected counts
\(E_i\) (denominator)	Standardisation	Scales the deviation relative to the expected count, preventing large cells from dominating
\(R_i\)	Row total	Sum of all counts in row \(i\)
\(C_j\)	Column total	Sum of all counts in column \(j\)
\(N\)	Grand total	Total number of observations

Decision rule

If \(\chi^2_{\text{calculated}} > \chi^2_{\text{critical}}\) at significance level \(\alpha\) → Reject \(H_0\)
If \(\chi^2_{\text{calculated}} \leq \chi^2_{\text{critical}}\) → Fail to reject \(H_0\)

Assumptions

Observations are independent of each other.
Each observation falls into exactly one cell.
Expected frequency in each cell is \(\geq 5\) (if violated, consider Fisher’s Exact Test).
Data are counts (frequencies), not proportions or percentages.

Test of Goodness-of-Fit

Used to determine whether observed frequencies match a theoretically expected distribution.

Formula

\[\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}\]

\[df = k - 1\]

where \(k\) is the number of categories.

Test of Homogeneity

Used to determine whether two or more independent groups share the same distribution of a categorical variable.

The Formula is the same as the Test of Independence:

\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\]

\(df = (r - 1)(c - 1)\)

Note: The key distinction from the Test of Independence is the study design - in homogeneity testing, group membership (rows) is fixed by the researcher, whereas in independence testing, both variables are observed freely.

Comparison of the three Chi-Square tests

Feature	Independence	Goodness-of-Fit	Homogeneity
Variables	2 categorical	1 categorical	1 categorical across groups
Data structure	Contingency table	Single frequency table	Contingency table
Sampling	One sample	One sample	Multiple independent samples
Question	Association?	Matches distribution?	Same distribution across groups?
df	\((r-1)(c-1)\)	\(k - 1\)	\((r-1)(c-1)\)

Example for Test of Independence

To test whether two categorical, Gender and Education level variables are independent.

Test of Independence (Cont’ue)

The chi-square value is calculated via:

Research Question: Is umbrella use dependent on gender?

Note

Calculated Chi-square value is smaller than 3.841. No significant difference. Men and women do not differ significantly regarding umbrella use.

CORRELATION ANALAYSIS

Statistical Methods for Testing Correlations

What is Correlation?
Correlation analysis is a statistical method used to examine the relationship between two continuous or ordinal variables.
It measures:
- The direction of the relationship (positive or negative)
- The strength of the relationship (weak or strong)
The measure of this relationship is the correlation coefficient, which ranges from -1 to +1.

Correlation vs Causation

Correlation shows association, not causation.
- A strong correlation does not prove that changes in one variable cause changes in the other.
Example: Childhood speech and school success:
- There may be a correlation, but correlation alone doesn’t prove that speaking earlier causes better school success.

Correlation Type	Meaning	Example
Positive (+)	As one variable increases, the other also increases.	Height and shoe size
Negative (-)	As one variable increases, the other decreases.	Price and sales volume
No Correlation (0)	No linear relationship between variables.	Random variables

Pearson Correlation Analysis

The Pearson correlation coefficient measures the linear relationship between two continuous (interval/ratio) variables.

Step 1: Covariance

Covariance measures how two variables change together.
- Positive covariance → Positive relationship
- Negative covariance → Negative relationship
Covariance formula:

\[Cov(x,y) = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{N-1}\]

Pearson Correlation coefficient

Step 2: Correlation

Covariance is not standardized, making it hard to compare across different datasets.
So, we normalize it to get the correlation coefficient:

\[r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 (y_i - \bar{y})^2}}\]

Called Pearson correlation coefficient & can take values between -1 and +1.
Before calculating the Pearson correlation, it’s important to visually check the relationship:
- Scatterplots help detect if the relationship is linear or non-linear.

Pearson correlation: linearity

Pearson correlation only captures linear relationships. If the data form a curve or other non-linear pattern, Pearson’s r may not be appropriate.

Note

If these conditions are not met, then the Spearman correlation is used.

REGRESSION ANALYSIS

What is Regression?

Regression is a statistical method to model the relationship between a dependent variable and one or more independent variables.

It is used to measure the influence of predictors and predict outcomes.

Example: Predicting a person’s salary based on education level, weekly working hours, and age.

Variables:

Dependent Variable: The variable being predicted (e.g., salary), and Independent Variables: Variables used for prediction (e.g., education, working hours, age).

Types of Regression

Regression helps understand how the dependent variable changes when one unit of an independent variable changes, holding others constant.

Type	Description
Simple Linear Regression	One independent variable
Multiple Linear Regression	Two or more independent variables
Logistic Regression	Predicts categorical outcomes (e.g., yes/no decisions)

Simple vs. Multiple Regression

Simple Regression: Use when one independent variable (IV) predicts the dependent variable (DV).
- Example: Does long work time (IV) affect a person’s income (DV)?
Multiple Regression: Use when two or more independent variables predict the DV.
- Example: Do work hours (IV₁), age (IV₂), and education (IV₃) affect a person’s income (DV)?

Simple Linear Regression

Predicts the value of a dependent variable (DV) based on one independent variable (IV).
The stronger the linear relationship between IV and DV, the more accurate the prediction.
A higher proportion of explained variance in the DV leads to better prediction quality.
A scatter plot can illustrate the relationship — when the relationship is strong, data points cluster closely along a straight line.

Method of Least Squares

Linear regression uses the Method of Least Squares to find the best-fit line:

\[\hat{y} = b \cdot x + a\]

Term	Name	Meaning
\(\hat{y}\)	Estimated dependent variable	Predicted \(y\)-value for each \(x\)-value
\(x\)	Independent variable	The predictor input
\(b\)	Slope	How much \(y\) changes when \(x\) increases by one unit
\(a\)	Intercept	Where the line crosses the \(y\)-axis (value of \(\hat{y}\) when \(x = 0\))

Concept of Residual

The residual (error) is the difference between the actual and predicted \(y\)-values:

\[e = y - \hat{y}\]

Goal: Minimize the sum of squared residuals (Ordinary Least Squares — OLS).

Interpretation of Slope (\(b\))

\(b > 0\): Positive relationship (as \(x\) increases, \(y\) increases)
\(b < 0\): Negative relationship (as \(x\) increases, \(y\) decreases)
\(b = 0\): No relationship between \(x\) and \(y\)

Multiple Linear Regression

In multiple linear regression, more than one independent variable is used to predict a single dependent variable. It allows for more accurate and complex prediction by considering multiple influencing factors.

Example

Investigating cholesterol levels in patients:
- Independent variables: age, hours of exercise per week, dietary habits, etc.
- Dependent variable: cholesterol level.

Interpretation

An increase in an independent variable \(x_i\) by one unit will change the dependent variable \(y\) by \(b_i\) units, holding all other variables constant.

Coefficient of Determination (R²)

R² (also known as variance explained) shows how much of the variance in the dependent variable can be explained by the independent variables in the model.

\[R^2 = \frac{S^2_{\hat{y}}}{S^2_{y}} = \frac{\text{Variance of the Predicted values}}{\text{Variance of the Observed values}}\]

\(R^2 = 1\): Perfect fit — all variance explained.
\(R^2 = 0\): No variance explained — independent variables do not help predict the dependent variable.

The higher the \(R^2\), the better the regression model fits the data.

Adjusted R²

R² increases as more independent variables are added to the model, even if those variables don’t contribute meaningfully. Adjusted R² compensates for this by penalising for the number of predictors.

\[R^2_{adj} = 1 - \left(1 - R^2\right)\cdot\frac{n-1}{n-p-1}\]

Term	Name	Meaning
\((1 - R^2)\)	Unexplained variance	Fraction of variance in the outcome not captured by the model
\((n - 1)\)	Total degrees of freedom	Anchors the baseline variability of the dataset
\((n - p - 1)\)	Residual degrees of freedom	Shrinks as \(p\) grows — this is the penalty term

Why the penalty works

When a new predictor is added, \((n - p - 1)\) decreases. Two scenarios arise:

Useful predictor: \((1 - R^2)\) drops meaningfully, offsetting the smaller denominator \(\Rightarrow\) \(R^2_{adj}\) increases or stays stable.
Useless predictor: \((1 - R^2)\) barely changes, but the ratio \(\frac{n-1}{n-p-1}\) grows \(\Rightarrow\) \(R^2_{adj}\) falls, signalling the predictor was not worth including.

Key properties

\(R^2_{adj} \leq R^2\) always, with equality only when \(p = 0\).
\(R^2_{adj}\) can be negative if the model fits worse than a horizontal line (intercept only).
As \(n \to \infty\), the penalty vanishes and \(R^2_{adj} \approx R^2\), because large samples make overfitting less of a concern.
Use \(R^2_{adj}\) when comparing models with different numbers of predictors — it provides a fair, penalised basis for comparison.

Assumptions of Linear Regression

For the results of regression analysis to be valid, the following assumptions must be met:

Assumption	Description
Linearity	There must be a linear relationship between the dependent and independent variables
Homoscedasticity	The variance of residuals must be constant across all levels of the independent variable(s)
Normality	The errors (residuals) must be normally distributed
No Multicollinearity	Independent variables should not be highly correlated with each other
No Autocorrelation	Residuals should not show patterns or correlations across observations

Linearity

Linear regression assumes a linear relationship between the dependent and independent variables. The goal is to draw a straight line that best represents the data points.

Left Graph: A linear relationship is visible - data points align closely to the straight line, meaning a regression model will work effectively.
Right Graph: A non-linear relationship is visible - a straight line cannot accurately represent the data, which may lead to incorrect predictions and conclusions.

Consequences of Non-Linearity

Non-linearity can produce invalid regression coefficients and misleading predictions, leading to substantial errors and poor decision-making.

Homoscedasticity

In a regression model, there is always error (residuals) in predicting the dependent variable. Homoscedasticity means that the variance of the residuals is constant across all predicted values.

To test for homoscedasticity, plot: Dependent variable (DV) on the \(x\)-axis vs Residuals (errors) on the \(y\)-axis.

If homoscedasticity exists → residuals scatter evenly across all values.
If heteroscedasticity exists → residuals show varying spread depending on the range of the DV.

Heteroscedasticity causes inaccurate regression estimates and unreliable predictions, which may lead to incorrect conclusions.

Example

Weight	Height	Age	Gender
79	1.80	35	Male
69	1.68	39	Male
73	1.82	25	Male
95	1.70	60	Male
82	1.87	27	Male
55	1.55	18	Female
69	1.50	89	Female
71	1.78	42	Female
64	1.67	16	Female
69	1.64	52	Female

The aim is to predict body weight.
The dependent variable is body weight.
The independent variables are body height, age, and gender.

Results

Interpretation of Results

Model Summary

\(R^2 = 75.4\%\) → 75.4% of the variation in weight is explained by the independent variables (height, age, and gender).
\(R^2_{adj} = 63\%\) → After adjusting for the number of predictors and degrees of freedom, about 63% of the variance is truly explained by the model. Adjusted R² provides a more realistic measure of model fit, penalising the addition of variables that don’t improve the model.
Average prediction error (residual standard error) = 6.587 kg.

Regression Equation

\[\text{Weight} = 47.379 \times \text{Height} + 0.297 \times \text{Age} + 8.922 \times \text{is\_male} - 24.41\]

Interpreting Coefficients

Age: Each additional year increases weight by 0.297 kg (holding other variables constant).
Gender (is_male): Being male adds 8.922 kg to weight compared to females.

Hypothesis Testing

\(H_0\): Coefficient \(= 0\) (no effect)
\(H_1\): Coefficient \(\neq 0\) (has effect)

In this model, only Age has \(p < 0.05\), making it the only statistically significant predictor.

SURVIVAL ANALAYSIS

What is **Survival Analysis?

Survival analysis is a statistical method used to examine the time until a specific event occurs, such as: Death, Disease onset, Relapse, Recovery, Equipment failure
It focuses on time-based variables, measuring the duration between a start event and an end event.
Time is typically recorded in days, weeks, or months.

Key Components

Start time: The point when observation begins (e.g., diagnosis date).
Event: The point when the outcome occurs (e.g., death, failure).
Survival time: The time between the start and the occurrence of the event.

Start = End of withdrawal process, Event = Relapse; Survival time = Number of days or weeks until relapse
Time from disease diagnosis to death: Start = Diagnosis date, Event = Death

What is Censoring?

Censoring occurs when:
The event of interest has not happened by the study’s end.
The subject leaves the study before the event occurs.

Censoring (continued)

Ignoring censoring leads to biased results and incorrect survival estimates.
Censoring is a natural and common part of survival studies and a critical feature that distinguishes survival analysis from other types of statistical modeling.
Imagine you are a dental technician analyzing the lifespan of tooth fillings:
- Start time: Day the filling is placed.
- Event: Filling breaks or falls out.
You record the survival time for each patient’s filling. However, two important scenarios arise:
A patient’s filling has not failed by the end of the study.
A patient drops out (moves away, changes dentist) before the filling fails.
These situations introduce censoring.

Basic Concepts of Survival analysis

Survival time analysis uses specialized statistical methods designed to handle time-to-event data and account for censoring.

The three most common methods are:

Kaplan-Meier Survival Curves
Log-Rank Test
Cox Proportional Hazards Regression

Kaplan-Meier Curve

The Kaplan-Meier curve is one of the most widely used methods for estimating survival functions.
It answers a key question: What is the probability that the event of interest has not occurred by a certain time point?
Kaplan-Meier Plot Axes: X-axis: Time (e.g., days, weeks, months), Y-axis: Probability of survival (from 1 or 100% down to 0)
The Kaplan-Meier curve would display the probability that a filling remains intact over time.
With the curve, you can answer questions like:
- What proportion of fillings last at least 5 years?
- How rapidly does the failure rate increase after insertion?
- At what point have 50% of the fillings failed (the median survival time)?

Comparing different groups

When studying survival time, researchers often want to compare two or more groups. For example:
- Comparing patients receiving two different treatments.
- Comparing male vs. female patients.
- Comparing different age groups or exposure levels.
In such cases, the Kaplan-Meier curve is drawn separately for each group. Each line on the plot represents the estimated survival rate over time for a particular group.

Comparing different groups (continued)

Visual comparison of the survival curves can suggest differences, for example, one group may show faster failure rates than another.
- However, visual inspection alone is not enough.
We need a formal statistical test to check whether these differences are statistically significant using Log-Rank Test.
The Log-Rank Test is a statistical test used to compare survival distributions between two or more independent groups.
It answers the question: Is there a statistically significant difference in the time-to-event between the groups?
The Log-Rank test is based on comparing the observed number of events in each group with the expected number of events under the assumption that the groups have the same survival experience.

Hypotheses in the Log-Rank Test

Null Hypothesis (H₀): The survival distributions of the groups are identical. (There is no difference in survival times between groups.)
Alternative Hypothesis (H₁): The survival distributions of the groups are different. (There is a difference in survival times between groups.)

Result

Interpreting the Log-Rank Test

If the p-value is small (typically < 0.05):
- Reject the null hypothesis.
- Conclude that there is a statistically significant difference between the groups’ survival experiences.
If the p-value is large (typically ≥ 0.05):
- Fail to reject the null hypothesis.
- Conclude that there is no significant difference between the groups.

Cox Regression

What is Cox Regression?
Cox Regression (also called the Cox Proportional Hazards Model) is a method used in survival analysis to:
- Examine the influence of several variables (continuous, binary, or categorical) on survival time.
- Adjust for multiple variables at the same time.
- Predict how changes in the variables affect the risk (hazard) of the event occurring.
allows us to determine the effects of multiple independent variables on a time-to-event outcome,
- to test hypotheses about which factors impact survival,
- to build predictive models based on those factors.

Cox Regression Model (Cox Proportional Hazards Model)

The Cox model describes the hazard (risk) of an event over time:

\[H(t) = H_0(t) \times \exp\left(b_1 x_1 + b_2 x_2 + \cdots + b_k x_k\right)\]

\(H_0(t)\): Baseline hazard when all predictors are zero.
\(x_1,\ldots,x_k\): Predictor variables, \(b_1,\ldots,b_k\): Regression coefficients.
The exponential of a coefficient exp(b) gives the Hazard Ratio (HR):
- For binary variables (e.g., exposed vs. unexposed), HR shows how much more (or less) likely the event is in the exposed group.
- For continuous variables (e.g., age), HR shows the change in hazard per one-unit increase:
  - Example: HR = 1.03 for age → each year increases the hazard by 3%.
  - Example: HR = 0.85 for albumin → each 1 g/dL increase reduces the hazard by 15%.
Interpretation assumes other covariates are held constant.

Cox Regression: data

Let’s assume that we have the following data and we want to evaluate them.

Cox Regression: results

The following results of a Cox regression is generated from the above data

Interpretation

Coefficient (β): Reflects the direction and strength of the variable’s association with survival.
- Negative β → Lower risk (longer survival).
- Positive β → Higher risk (shorter survival).
P-value: Tests whether the coefficient is significantly different from zero.

Hypothesis Testing in Cox Regression

Null Hypothesis (H₀): The coefficient is zero (no effect on survival).
Alternative Hypothesis (H₁): The coefficient is not zero (there is an effect on survival).
Decision rule:
- If p-value < 0.05 → Reject H₀ → Variable has a significant effect.

Assumptions of Cox Regression

Proportional Hazards Assumption

The hazard ratio between any two individuals is constant over time.
- The effect of a predictor variable on survival is the same at all times.
If this assumption is violated, the model may produce biased results.
Note: Graphical methods (e.g., Schoenfeld residuals) can be used to check this assumption.

Independence Assumption

The survival times of individuals are independent of each other, given their predictor values.
- One person’s survival time should not affect another’s.
If survival times are correlated (e.g., clustered within hospitals), special methods like frailty models may be needed.

Linearity Assumption

The relationship between continuous predictor variables and the log hazard is linear.
This means that for each unit increase in the predictor, the effect on the log of the hazard is constant.
If the relationship is not linear, transformations or splines can be used.

Assumption	Description	How to Check
Proportional Hazards	The hazard ratio between individuals is constant over time	Schoenfeld residuals plot
Independence	Survival times of individuals are independent of each other	Study design; use frailty models if clustered
Linearity	Continuous predictors have a linear relationship with log hazard	Martingale residuals; splines

BASIC CONCEPTS OF LONGITUDINAL ANALAYSIS

Basic concepts of Longitudinal Data analysis

In survey research and data collection, two major methods are:
- Longitudinal studies and Cross-sectional studies
Both are widely used across health, social science, and humanitarian fields.
Knowing their differences is crucial for designing effective research and managing data collection workflows

Key Differences

Feature	Longitudinal Study	Cross-Sectional Study
Timing	Over multiple time points	At a single point in time
Objective	Track changes or trends	Describe current status
Subjects	Same individuals followed	Different individuals sampled
Strengths	Detects cause-and-effect	Quick, cost-effective

When to Use a Longitudinal Study

When the research question involves changes, trends, or trajectories over time.
When exploring the effect of time-dependent exposures.
When within-subject comparisons are critical.
When aiming to establish temporal sequences for potential causal inferences.

Example: Tracking changes in maternal nutritional status from early pregnancy to postpartum to examine impacts on birth outcomes.

Key Features of Longitudinal Data

Statistical techniques like ANOVA and regression usually assume independent and identically distributed (iid) residuals.
In practice, data often violate this assumption due to correlations among observations.
Correlated data structures include:
- Clustered data, Repeated measurements, Spatially correlated data

Examples for clustered data: Families, Schools, Hospitals, Towns.

When repeated measurements are collected over time, the resulting data form a longitudinal (or panel) study.
- allows researchers to examine how individuals change over time and what factors influence those changes.
Therefore, these correlated data have correlation among observations, which violates assumption of iid.

Objectives of Longitudinal Studies

Characterize how a response changes over time.
Identify factors that influence these changes.
Key Advantages:
Distinguish within-individual changes
- (e.g., how a person’s condition evolves measure by score for tests).
Separate between-individual differences
- (e.g., how people differ overall).

Exploratory Data Analysis (EDA) in Longitudinal Studies

What is Exploratory Data Analysis (EDA)?
Exploratory analysis comprises techniques to visualize patterns in the data.
Data analysis must begin by making displays that expose patterns relevant to the scientific question.
Therefore EDA helps:
- Helps uncover expected and unexpected patterns.
- Graphical displays are crucial to highlight relationships and trends.
- Indicates which model will be appropriate for analysis.

Example: Jimma Infant Survival Data

A follow-up study of newborn infants in Southwest Ethiopia with measurements at 7 time points (every 2 months, 0-12 months).

Variable	Description	Type
`ind`	Infant ID	Numeric (ID)
`sex`	Sex of the infant	Categorical
`place`	Place of residence	Categorical (1=Urban, 2=Rural)
`weight`	Weight (grams)	Numeric
`length`	Length/height (cm)	Numeric
`Bf`	Breastfeeding status	Binary (1=Yes, 0=No)
`age`	Age (months: 0,2,4,6,8,10,12)	Numeric
`BMIBIN`	BMI category	Binary (1=Normal, 0=Other)

Study Overview

Follow-up period: 12 months with Seven time points per child
- Measurements taken every two months
Weight recorded at each visit
Research Question: How does weight change over time?

Long Format vs. Wide Format

Long Format: one row per observation per time point:

ind	sex	place	weight	length	Bf	age	BMIBIN
1	male	2	3000	50	1	0	1
1	male	2	5200	59	1	2	1
1	male	2	5900	65	1	4	1

Wide Format: one row per individual:

ind	sex	place	weight_age0	weight_age2	weight_age4	length0	length2	length4
1	male	2	3000	5200	5900	50	59	65
2	female	2	3900	5500	6500	60	55	61

Most longitudinal models require long format.

individual Vs Mean profiles

Conclusions from the profile:

Much variability between children
Fixed number of measurements per subject

Considerable variability within subjects
Measurements taken at fixed time points

Exploratory analysis conclusion

Conclusion - From the exploratory analysis:

Mean structure seems linear over time.
Variability between subjects at baseline.
Variability between subjects in the way they evolve.
Hence, a linear mean with random interception and slope is a good idea…

Exploring the random effects

Choosing Random Effects:
- Decide which parameters need random effects to account for between-group variation.

Covariance Structure:
- The pairs method helps explore the covariance structure among random effects.

Linear Mixed Model

Correctly modeling correlation is essential for valid inference about regression coefficients.
Extend general linear models to account for correlated errors.
Combine:
- Fixed effects (e.g., sex, age group)
- Random effects (e.g., individual subjects)
LMMs make assumptions about:
Mean structure: linear or nonlinear
Variance function: constant or changing (e.g., quadratic)
Correlation structure: independent, serial, etc.
Subject-specific profiles: linear, quadratic, etc.

The model is given by

\[Y_i = \underbrace{X_i \beta}_{\text{fixed effect}} + \underbrace{Z_i b_i}_{\text{random effect}} + \varepsilon_i\]

Where:

\[b_i \sim N(0, D), \quad \varepsilon_i \sim N(0, \Sigma_i)\]

\[b_1, b_2, \ldots, b_N,\ \varepsilon_1, \varepsilon_2, \ldots, \varepsilon_N \ \text{are independent}\]

Term	Meaning
\(\beta\)	Fixed effects (population-level)
\(b_i\)	Random effects (individual-level)
\(D\)	Covariance of random effects
\(\Sigma_i\)	Residual covariance

LMM result for Jimma infant data

Fixed Effect	\(\beta\)	95% CI	p-value
Age	433	389, 477	< 0.001
Sex - Female (ref: Male)	282	−273, 837	0.3
Breastfed (ref: Not breastfed)	244	−314, 802	0.4
Random Effect
Random intercept	345.302
Random slope of time	54.087
Residual	656.924
ICC	34.5%

Note

Age is the only highly significant predictor: each additional month increases weight by approximately 433 grams on average, adjusting for sex and breastfeeding status.

Intraclass Correlation Coefficient (ICC)

The ICC quantifies what proportion of the total variability is attributable to between-individual differences.

\[ICC = \frac{\sigma^2_{\text{between}}}{\sigma^2_{\text{between}} + \sigma^2_{\text{within}}} = \frac{345.302}{345.302 + 656.924} \approx 0.345\]

Source of Variability	Proportion
Between-individual differences	34.5%
Within-individual changes over time	65.5%

An ICC of 34.5% indicates substantial individual variation in both initial weight and growth rate.
This confirms that ignoring random effects would lead to invalid standard errors and misleading inference.

Thank you!

Questions & discussion are welcome

C4ED | Center for Evaluation and Development
Day 1 · Basic Quantitative Methods
yebelay.ma@gmail.com

Descriptive statistics Hypothesis testing Regression Survival analysis Longitudinal data

Concepts of Basic Quantitative Data Analysis

Training Outline

Goal of the training

What is statistics and quantitative data analysis?

Branches of Statistics

Purpose of Descriptive Statistics

Purpose of Descriptive Statistics

Types of variables

What is a Variable?

Measure of central tendency

Mean

Median

Mean vs Median

Mean vs Median (continued)

Mode (Modal Value)

Measure of Dispersion

Standard deviation

Example of variance and standard deviation calculation

Interquartile range (IQR)

Tabular Presentation

Charts

Bar chart

Bar chart (continued)

Grouped bar charts

Overall, bar chart is

Histogram

Used for normality checking

Polygon

Box plot

How is a boxplot interpreted?

Boxplot illustration

Scatter Plots

Scatter Plots (continued)

Good practice in data presentation: tabular result

Tabular presentation or reporting

Graphical data presentation or reporting

Inferential Statistics

Random Sample and Sampling Error

Random Sample and Sampling Error (continued)

Hypothesis testing

Steps to formulate and test a hypothesis

Null and alternative hypothesis

Level of significance or probability of error

Example: Two-Sample t-Test

Types of Errors in Hypothesis Testing

Type I Error (\(\alpha\)) - False Positive

Type II Error (\(\beta\)) - False Negative

Statistical Power

Why Errors Happen?

Choosing the Right Statistical Test

Common Statistical Tests

a) T-test

b) Chi-Square Test

c) One-way / Two-way ANOVA

Checking Assumptions of Statistical Tests

How to Check Assumptions

Normality Assumption

Graphical Checks

Formal Normality Tests

What to Do When Normality Is Violated

Confidence Interval

Confidence Interval (continued)

Statistical tests for differences

One-Sample t-Test (continued)

Example: for pre specified average µ = 28

Two-Sample t-Test (Independent Samples)

Hypotheses

Assumptions of the Independent t-test

What is your conclusion?

Conclusion of the independent t-test

Report a t-test for independent samples:

Paired-Samples t-test (Dependent t-test)

Look at before and after weight measure

ANOVA (Analysis of Variance)

Types of Anova

Example for One-factor ANOVA

ANOVA hypotheses

Two-Way ANOVA (Two-Factor ANOVA)

Two-Way ANOVA (continued)

Steps in Two-Way ANOVA Calculation